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earned patent term adjustment. See 37 CFR 1.704(b). 
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DETAILED ACTION 

1. Claims 1-7 and 9-35 have been examined. 

Papers Submitted 

2. It is hereby acknowledged that the following papers have been received and placed of 
record in the file: Amendment as received on 7/20/2004. 

Specification 

3. The disclosure is objected to because of the following informalities: On page 1 1, 
paragraph [0035], line 2, replace "170" with -710--. On page 1 1, paragraph [0036], last line, 
replace "continues at A" with -continues at block 710—. 

Appropriate correction is required. 

Drawings 

4. The drawings are objected to because of the following minor informalities: In Fig.6, 
component 100 should be labeled as "Multiprocessor System" or something along those lines. 
Also, in Fig.9, it appears the "no" should be moved from the left side of the decisional block 
labeled "Clean Load Matched Any Tag" to the bottom it where the arrow is emanating from. 
Also, for the decisional step labeled "Load Matching Tag And Store Valid Bit "0"?", the 
examiner believes that a "no" arrow should emanate from that block. 

Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to 
the Office action to avoid abandonment of the application. Any amended replacement drawing 
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sheet should include all of the figures appearing on the immediate prior version of the sheet, 
even if only one figure is being amended. The figure or figure number of an amended drawing 
should not be labeled as "amended." If a drawing figure is to be canceled, the appropriate figure 
must be removed from the replacement sheet, and where necessary, the remaining figures must 
be renumbered and appropriate changes made to the brief description of the several views of the 
drawings for consistency. Additional replacement sheets may be necessary to show the 
renumbering of the remaining figures. The replacement sheet(s) should be labeled "Replacement 
Sheet" in the page header (as per 37 CFR L 84(c)) so as not to obstruct any portion of the 
drawing figures. If the changes are not accepted by the examiner, the applicant will be notified 
and informed of any required corrective action in the next Office action. The objection to the 
drawings will not be held in abeyance. 

Claim Objections 

5. Claim 2 is objected to because of the following informalities: Replace "comprise of 
with -comprise—. Appropriate correction is required. 

6. Claim 4 is objected to because of the following informalities: Replace "having" with 
-store—. Appropriate correction is required. 

7. Claim 9 is objected to because of the following informalities: Insert —wherein- before 
"the trace buffer" and delete "having". Appropriate correction is required. 

8. Claim 32 is objected to because of the following informalities: Replace "having" with 
—store—. Appropriate correction is required. 
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Claim Rejections -55 USC §103 

9. The following is a quotation of 35 U S C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

10. Claims 1-7, 9-13, 17-19, 27, and 29-35 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Sundaramoorthy et aL, "Slipstream Processors: Improving both Performance 
and Fault Tolerance," ASPLOS, Nov. 2000 (as applied in the previous Office Action and herein 
referred to as Sundaramoorthy) in view of Hennessy and Patterson, "Computer Architecture - A 
Quantitative Approach, 2 nd Edition," 1996 (herein referred to as Hennessy). 

11. Referring to claim 1, Sundaramoorthy has taught an apparatus comprising: 

a) a first processor and a second processor (fig. 1, R-stream and A-stream processors. Each 
processor comprising the execute core); 

b) a plurality of memory devices coupled to the first processor and the second processor (fig. 1, 
I-cache and D-cache memories); 

c) a register buffer coupled to the first processor and the second processor (fig . 1, the delay 
buffer serves as a register buffer when an IR-misprediction is detected because the register 
values are copied from the R-stream processor to the A-stream processor via the delay buffer [as 
shown by the dotted lines in fig. 1 and col 12, lines 7-12]); 

d) a trace buffer coupled to the first processor and the second processor (fig. 1; col 10, lines 17- 
19; col 1 1, lines 7-17: the delay buffer during normal operation serves as a trace buffer because 
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the control and data information of instructions traced is passed from the A-stream to the R- 
stream processor); and 

e) a plurality of memory instruction buffers coupled to the first processor and the second 
processor (fig. 1, a separate reorder buffer is connected to each processor); 

f) wherein the first processor and the second processor perform single threaded applications 
using multithreading resources (col 1, lines 53-54, col. 2, lines 18-20: teaches that a single 
thread is instantiated twice to create two threads and each thread is run on different processors). 

g) Sundaramoorthy has not taught that the first and second processors each have a scoreboard 
and a decoder. 

However, Official Notice is taken that instruction decoders are well known and expected 
in the art. More specifically, after instructions are fetched by a processor, they must inherently 
be decoded so that the processor may determine what type of instruction has been fetched and 
consequently, what operation to perform. Clearly, if both processors of Sundaramoorthy are 
fetching instructions, then both sets must be decoded. As a result, it would have been obvious to 
have instruction decoders in each of the first and second processors so that instructions may be 
decoded. One would have been motivated to make such a modification to allow both processors 
to decode their own instructions. 

In addition, Hennessy has taught that a scoreboard allows instructions to execute out of 
order. As is known in the art, out-of-order execution is advantageous because it allows 
instructions to execute as soon as their resources are ready, thereby reducing stalling and CPU 
idleness. See pages 241 and 242. As a result, in order to allow both processors to benefit from 
such execution and resulting advantages, it would have been obvious to one of ordinary skill in 
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the art at the time of the invention to modify each of the first and second processors of 
Sundaramoorthy to include scoreboards. 

12. Referring to claim 2, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy has further taught that the memory devices comprise a 
plurality of cache devices (Fig. 1, 1-Cache and D-Cache). 

13. Referring to claim 3, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy has further taught that the first processor is coupled to at 
least one of a plurality of zero level (LO) data cache devices and at least one of a plurality of L0 
instruction cache devices, and the second processor is coupled to at least one of the plurality of 
L0 data cache devices and at least one of the plurality of LO instruction cache devices (fig. 1 
shows that each processor is connected to a separate data cache (D-Cache) and instruction (I- 
Cache) which can be considered as zero-level caches because they are directly connected to the 
execute cores). 

14. Referring to claim 4, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 3. Sundaramoorthy has further taught that each of the plurality of LO data 
cache devices have exact copies of store instruction data. Although this is not mentioned 
explicitly, it is deemed inherent to the design because as each processor is executing the same 
thread (col. 1, lines 53-54, col. 2, lines 18-20) the data caches in each processor must contain 
exact copies of data. And, this data is store instruction data because data that is stored to main 
memory is also stored in a data cache. 

15. Referring to claim 5, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy has further taught that the plurality of memory 
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instruction buffers includes at least one store forwarding buffer (fig. 1, reorder buffer connected 
to A-stream processor) and at least one load-ordering buffer (fig. 1, reorder buffer connected to 
R-stream processor). 

16. Referring to claim 6, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 5. Although Sundaramoorthy in view of Hennessy does not mention that the 
at least one store forwarding buffer (fig. 1, reorder buffer (ROB) connected to A-stream 
processor) comprises a structure having a plurality of entries, each of the plurality of entries 
having a tag portion, a validity portion, a data portion, a store instruction identification (ID) 
portion, and a thread ID portion it is deemed inherent to the design. A ROB is used to order 
instructions completing execution hence must contain a plurality of entries. Also each entry must 
have a tag portion to index into the ROB, a validity portion to indicate whether an entry can be 
written to or read from, a data portion for storing the results of the instruction, a store instruction 
ID portion would be the instruction opcode of an entry, and a thread ID for indicating which 
thread that instruction belongs to. 

17. Referring to claim 7, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 6. Although Sundaramoorthy in view of Hennessy does not mention that the 
at least one load ordering buffer (fig. 1, reorder buffer connected to R-stream processor) 
comprises a structure having a plurality of entries, each of the plurality of entries having a tag 
portion, an entry validity portion, a load identification (ID) portion, and a load thread ID portion 
it is deemed inherent to the design. A ROB is used to order instructions completing execution 
hence must contain a plurality of entries. Also each entry must have a tag portion to index into 
the ROB, a validity portion to indicate whether an entry can be written to or read from, a load 
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instruction ID portion would be the instruction opcode of an entry, and a thread ID for indicating 
which thread that instruction belongs to. 

18. Referring to claim 9, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1. Furthermore, although Sundaramoorthy has taught that the trace buffer 
(delay buffer) is a FIFO queue (col. 10, line 17), they do not disclose that the trace buffer is a 
circular buffer having an array with head and tail pointers, the head and tail pointers having a 
wrap-around bit. However, "Official Notice" is taken that it is well known and expected in the 
art to implement a FIFO queue as a circular buffer with head and tail pointers wherein head and 
tail pointers have a wrap-around bit. A circular buffer is useful to implement in hardware 
because only the head and tail pointers need to be increment ed/decremented instead of actually 
physically shifting entries. A wrap around bit would also be needed to indicate whether the 
pointer has wrapped around the end of the queue. Therefore, it would been obvious to one of 
ordinary skill in the art at the time of the invention to have implemented the FIFO queue as a 
circular buffer with head and tail pointers, the head and tail pointers having a wrap around bit 
because it is known that a FIFO queue can be implemented as a circular buffer and it is easier to 
build in hardware. 

19. Referring to claim 10, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1. Sundaramoorthy in view of Hennessy has not explicitly taught that the 
register buffer comprising an integer register buffer and a predicate register buffer. However, 
Official Notice is taken that integer registers and predicate registers are well known and expected 
in the art. By implementing integer registers, the system will be able to load and store integer 
data and perform integer operations quickly. Furthermore, by implementing predicate registers, 
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the system will be able to achieve conditional execution of instructions without conditional 
branch instructions. Consequently, to achieve such functionality, it would have been obvious to 
one of ordinary skill in the art at the time of the invention to modify Sundaramoorthy in view of 
Hennessy to include a integer register buffer and a predicate register buffer in the register buffer 
(delay buffer). 

20. Referring to claim 1 1, Sundaramoorthy has taught a method comprising: 

a) executing a plurality of instructions in a first thread by a first processor (col. 1, lines 53-54, 
col. 2, lines 18-23: The R-stream thread is executed by the R-stream processor in fig. 1. The R- 
stream processor comprises of the execute core and the IR-detector and IR-predictor). 

b) executing the plurality of instructions in the first thread by a second processor (col. 1, lines 
53-54, col 2, lines 18-32: The A-stream thread, which is the same as the R-stream thread, is 
executed by the A-stream processor) as directed by the first processor (col. 4, lines 21-38: IR- 
detector and IR-predictor in fig. 1, which are part of the first processor i.e. R-stream processor, 
direct the second processor (A-stream processor) to execute instructions from the A-stream), the 
second processor executing the plurality of instructions ahead of the first processor (col. 2, lines 
20-23: A-stream runs ahead of the R-stream and it is executed by the second processor (A-stream 
processor)). 

c) Sundaramoorthy has not taught tracking at least one register that is one of loaded from a 
register file buffer, and written by said second processor, said tracking executed by said second 
processor. However, Hennessy has taught the idea of a scoreboard which allows instructions to 
execute out of order. As is known in the art, out-of-order execution is advantageous because it 
allows instructions to execute as soon as their resources are ready, thereby reducing stalling and 
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CPU idleness. See pages 241 and 242, As a result, in order to allow the second processor to 
benefit from such execution and resulting advantages, it would have been obvious to one of 
ordinary skill in the art at the time of the invention to modify the second processor of 
Sundaramoorthy to include a scoreboard. And, the inherent nature of a scoreboard is to track 
registers written by the second processor. See Fig.4.4 on page 247, and note that the system 
tracks when registers are ready so that execution may continue. For registers to be ready, it must 
be tracked when the writing to those registers completes. 

21. Referring to claim 12, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 1 1 . Sundaramoorthy has further taught: 

a) transmitting control flow information from the second processor to the first processor, the first 
processor avoiding branch prediction by receiving the control flow information (col 10, lines 17- 
21,30-35, 43-46); and 

b) transmitting results from the second processor to the first processor, the first processor 
avoiding executing a portion of instructions (col 10, lines 17-21, 30-33, 35-38: results (data-flow 
information) are transmitted from the A-stream processor to the R-stream processor via the delay 
buffer, and these values are used directly by the instructions hence avoiding the execution of the 
portion of the instructions) by committing the results of the portion of instructions into a register 
file from a trace buffer (Although this is not explicitly mentioned, it is deemed inherent to the 
design because col. 4 line 15 discloses the presence of a register file in the processor and as 
results are written to the register file so that they can be read from by future instructions, the 
results of the instructions from the trace buffer (delay buffer) must be written into the register 
file). 



Application/Control Number: 09/896,526 Page 1 1 

Art Unit: 2183 

22. Referring to claim 13, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Sundaramoorthy has further taught duplicating memory information in 
separate memory devices for independent access by the first processor and the second processor. 
Although this is not mentioned explicitly, it is deemed inherent to the design because as each 
processor is executing the same thread (col. 1, lines 53-54, col 2, lines 18-20) the instruction and 
data caches in each processor (fig. 1) must contain exact copies of instructions and data. 

23. Referring to claim 17, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Sundaramoorthy has further taught executing a replay mode at a first 
instruction of a speculative thread (col. 1, lines 53-54, col. 2, lines 18-20: This feature is deemed 
inherent to the reference because when the A-stream is initially started, i.e., at the first 
instruction, there will be two redundant threads being executed which means the thread is being 
replayed from that point. This can be called a replay mode). 

24. Referring to claim 18, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Sundaramoorthy has further taught: 

a) issuing all instructions up to a next replayed instruction including dependent instructions (This 
feature is deemed inherent to the design because in order to execute the thread all instructions are 
issued in either one of the R-stream and A-stream processors). 

b) issuing instructions that are not replayed as no-operation (NOPs) instructions (This feature is 
also deemed inherent to the design because if an instruction that is not replayed does not occupy 
a slot in the execution pipeline it will lead to improper functioning of the processor. Hence as the 
instruction that is not replayed is not to be executed, a NOP must be issued in its place). 
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c) issuing all load instructions and store instructions to memory (This limitation is also deemed 
inherent to the design because if all loads and stores are not issued to memory, the state of the 
thread would be incorrect leading to the malfunctioning of the system). 

d) committing non-replayed instructions from the trace buffer to the register file (Although this is 
not explicitly mentioned, it is deemed inherent to the design because col. 4 line 15 discloses the 
presence of a register file in the processor and as results are written to the register file so that 
they can be read from by future instructions, the results of the instructions from the trace buffer 
(delay buffer) that are not going to be replayed must be written into the register file). 

e) Sundaramoorthy in view of Hennessy has not taught supplying names from the trace buffer to 
preclude register renaming. However, Hennessy and Patterson teach that register renaming is 
used to reduce name dependencies allowing instructions involved in name dependencies to 
execute simultaneously or be reordered (pg. 232, para. 5). As these dependencies are resolved, 
more instruction level parallelism can be extracted and performance can be improved. One of 
ordinary skill in the art would have recognized to use register renaming in the Sundaramoorthy 
reference because it too would improve performance. As the trace buffer (delay buffer) would 
also supply the names, it would be logical not to do the renaming again in the R-stream 
processor. Therefore, it would have been obvious to one of ordinary skill in the art at the time of 
the invention to have modified the Sundaramoorthy reference by adding register renaming 
capabilities and supply names from the trace buffer to preclude register renaming. One would 
have been motivated to do so because it would improve performance which is one of the 
objectives of the Sundaramoorthy reference (col 1, lines 29-36). 
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25. Referring to claim 19, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Sundaramoorthy has further taught clearing a valid bit in an entry in a 
load buffer (fig. 1, the reorder buffer connected to the R-stream processor) if the load entry is 
retired (Although not explicitly mentioned, it is deemed inherent to the design because a load 
entry, on being retired, has to be marked invalid to ensure that other new instructions can occupy 
that entry safely). 

26. Referring to claim 27, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 21 . Sundaramoorthy has further taught: 

a) issuing all instructions up to a next replayed instruction including dependent instructions (This 
feature is deemed inherent to the design because in order to execute the thread all instructions are 
issued in either one of the R-stream and A-stream processors). 

b) issuing instructions that are not replayed as no-operation (NOPs) instructions (This feature is 
also deemed inherent to the design because if an instruction that is not replayed does not occupy 
a slot in the execution pipeline it will lead to improper functioning of the processor. Hence as the 
instruction that is not replayed is not to be executed, a NOP must be issued in its place). 

c) issuing all load instructions and store instructions to memory (This limitation is also deemed 
inherent to the design because if all loads and stores are not issued to memory, the state of the 
thread would be incorrect leading to the malfunctioning of the system). 

d) committing non-replayed instructions from the trace buffer to the register file (Although this is 
not explicitly mentioned, it is deemed inherent to the design because col. 4 line 15 discloses the 
presence of a register file in the processor and as results are written to the register file so that 
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they can be read from by future instructions, the results of the instructions from the trace buffer 
(delay buffer) that are not going to be replayed must be written into the register file), 
e) Sundaramoorthy in view of Hennessy has not taught supplying names from the trace buffer to 
preclude register renaming. However, Hennessy and Patterson teach that register renaming is 
used to reduce name dependencies allowing instructions involved in name dependencies to 
execute simultaneously or be reordered (pg. 232, para. 5). As these dependencies are resolved, 
more instruction level parallelism can be extracted and performance can be improved. One of 
ordinary skill in the art would have recognized to use register renaming in the Sundaramoorthy 
reference because it too would improve performance. As the trace buffer (delay buffer) would 
also supply the names, it would be logical not to do the renaming again in the R-stream 
processor. Therefore, it would have been obvious to one of ordinary skill in the art at the time of 
the invention to have modified the Sundaramoorthy reference by adding register renaming 
capabilities and supply names from the trace buffer to preclude register renaming. One would 
have been motivated to do so because it would improve performance which is one of the 
objectives of the Sundaramoorthy reference (col. 1, lines 29-36). 
27. Referring to claim 29, Sundaramoorthy has taught a system comprising: 

a) a first processor (fig. 1, R-stream processor comprising of the execute core); 

b) a second processor (fig. 1, A-stream processor comprising of the execute core); 

c) a bus coupled to the first processor and the second processor (fig. 1, a bus is shown between 
the first and second processors via the delay buffer); 

d) a plurality of local memory devices coupled to the first processor and the second processor 
(fig. 1, 1-cache and D-cache memories); 
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e) a register buffer coupled to the first processor and the second processor (fig . 1, the delay 
buffer serves as a register buffer when an IR-misprediction is detected because the register 
values are copied from the R-stream processor to the A-stream processor via the delay buffer [as 
shown by the dotted lines in fig. 1 and col. 12, lines 7-12]); 

f) a trace buffer coupled to the first processor and the second processor (fig. 1; col. 10, lines 17- 
19; col 1 1, lines 7-17: the delay buffer during normal operation serves as a trace buffer because 
the control and data information of instructions traced is passed from the A-stream to the R- 
stream processor); and 

g) a plurality of memory instruction buffers coupled to the first processor and the second 
processor (fig. 1, a separate reorder buffer is connected to each processor); 

h) wherein the first processor and the second processor perform single threaded applications 
using multithreading resources (col. 1, lines 53-54, col. 2, lines 18-20: teaches that a single 
thread is instantiated twice to create two threads and each thread is run on different processors). 

i) Sundaramoorthy has not taught that the first and second processors each have a scoreboard and 
a decoder. 

However, Official Notice is taken that instruction decoders are well known and expected 
in the art. More specifically, after instructions are fetched by a processor, they must inherently 
be decoded so that the processor may determine what type of instruction has been fetched and 
consequently, what operation to perform. Clearly, if both processors of Sundaramoorthy are . 
fetching instructions, then both sets must be decoded. As a result, it would have been obvious to 
have instruction decoders in each of the first and second processors so that instructions may be 
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decoded. One would have been motivated to make such a modification to allow both processors 
to decode their own instructions. 

In addition, Hennessy has taught that a scoreboard allows instructions to execute out of 
order. As is known in the art, out-of-order execution is advantageous because it allows 
instructions to execute as soon as their resources are ready, thereby reducing stalling and CPU 
idleness. See pages 241 and 242. As a result, in order to allow both processors to benefit from 
such execution and resulting advantages, it would have been obvious to one of ordinary skill in 
the art at the time of the invention to modify each of the first and second processors of 
Sundaramoorthy to include scoreboards. 

j) Sundaramoorthy also has not taught a main memory coupled to the bus. However, Official 
Notice is taken that it is well known and expected in the art to have a main memory connected to 
multiple processors via a common bus in a multi-processor environment. Since caches do not 
store every instruction and data item, main memory must exist to store all of it. Therefore it 
would have been obvious to one of ordinary skill in the art at the time of the invention to have 
added a main memory coupled to the bus in the Sundaramoorthy reference. 

28. Referring to claim 30, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29, wherein the memory devices comprise of a plurality of cache devices (Fig. 
1, 1-Cache and D-Cache). 

29. Referring to claim 31, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29, wherein the first processor is coupled to at least one of a plurality of zero 
level (L0) data cache devices and at least one of a plurality of L0 instruction cache devices, and 
the second processor is coupled to at least one of the plurality of LO data cache devices and at 
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least one of the plurality of L0 instruction cache devices (fig. 1 shows that each processor is 
connected to a separate data cache (D-Cache) and instruction (I-Cache) which can be considered 
as zero-level caches because they are directly connected to the execute cores). 
30. Referring to claim 32, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 31, wherein each of the plurality of L0 data cache devices having exact copies 
of store instruction data. Although this is not mentioned explicitly, it is deemed inherent to the 
design because as each processor is executing the same thread (col. 1, lines 53-54, col. 2, lines 
18-20) the instruction and data caches in each processor must contain exact copies of instructions 
and data. And, this data is store instruction data because data that is stored to main memory is 
also stored in a data cache. 

3 1 Referring to claim 33, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 3 1 . Sundaramoorthy in view of Hennessy has not taught that the first 
processor and the second processor each sharing a first level (LI) cache device and a second 
level (L2) cache device. However, Official Notice is taken that it is well known and expected in 
the art that processors in a multi-processor environment share LI and L2 cache devices. Such a 
scheme allows for the simplification of cache coherency in that both processors would be able to 
access the same up-to-date cache as opposed to one of the processors accessing out-of-date 
information in its own cache. Therefore, it would have been obvious to one of ordinary skill in 
the art at the time of the invention to have the first and second processors share LI and L2 cache 
devices. 

32. Referring to claim 34, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29. Sundaramoorthy has further taught that the plurality of memory 
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instruction buffers includes at least one store forwarding buffer (fig. l 5 reorder buffer connected 
to A-stream processor) and at least one load-ordering buffer (fig. 1, reorder buffer connected to 
R-stream processor). 

33. Referring to claim 35, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 34. Sundaramoorthy in view of Hennessy has not taught that the at least one 
store forwarding buffer (fig. 1, reorder buffer (ROB) connected to A-stream processor) 
comprises a structure having a plurality of entries, each of the plurality of entries having a tag 
portion, a validity portion, a data portion, a store instruction identification (ID) portion, and a 
thread ID portion it is deemed inherent to the design. A ROB is used to order instructions 
completing execution hence must contain a plurality of entries. Also each entry must have a tag 
portion to index into the ROB, a validity portion to indicate whether an entry can be written to or 
read from, a data portion for storing the results of the instruction, a store instruction ID portion 
would be the instruction opcode of an entry, and a thread ID for indicating which thread that 
instruction belongs to. 



34. Claims 14-16 and 23-25 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Sundaramoorthy in view of Hennessy, as applied above, and further in view of Akkary, WO 
99/31594 (as applied in the previous Office Action). 

35. Referring to claim 14, Sundaramoorthy has taught a method as described in claim 12. 
Sundaramoorthy in view of Hennessy has not taught clearing a store validity bit and setting a 
mispredicted bit in a load entry in the trace buffer if a replayed store instruction has a matching 
store identification (ID) portion. However, Official Notice is taken that it is well known and 
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expected in the art to use load and store buffers for the proper handling of memory operations. 
Akkary discloses a system for ordering loads and stores in a multithreaded processor using load 
and store buffers (fig. 2, 182,184). He discloses clearing a store validity bit (SB Hit field) in the 
load buffer if data came from memory (pg. 37, para. 3, line 4; pg. 38, line 1). Also when a store 
instruction is executed (which includes replayed stores), its address is compared with the store 
ID portion (addresses) of load instructions (pg. 36, para. 3). On a match, a replay event is 
signaled to the load entry in the trace buffer to replay the load instruction and all its dependant 
instructions because it was mispredicted (pg. 38, para. 2). Furthermore, Official Notice is taken 
that is well known and expected in the art to set a status bit to indicate a misprediction. Clearly, 
in order to detect a misprediction, some bit must change somewhere in the system. As shown in 
In re Larson, 144 USPQ 347 (CCPA 1965), to make integral is generally not given patentable 
weight or would have been an obvious improvement. That is, it does not matter where this 
misprediction bit is located within the system, as long as it exists. One of ordinary skill in the art 
would have recognized that one could use the load and store buffer arrangement of Akkary in the 
Sundaramoorthy reference in order handle loads and stores in the multithreaded environment. 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the 
invention to have modified the Sundaramoorthy reference by clearing a store validity bit and 
setting a mispredicted bit in a load entry in the trace buffer (delay buffer) if a replayed store 
instruction has a matching store ID portion. 

36. Referring to claim 15, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Sundaramoorthy in view of Hennessy has not taught setting a store 
validity bit if a store instruction that is not replayed matches a store identification (ID) portion. 
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However, Official Notice is taken that it is well known and expected in the art to use load and 
store buffers for the proper handling of memory operations. Akkary discloses a system for 
ordering loads and stores in a multithreaded processor using load and store buffers (fig. 2, 
182,184). He discloses setting a store validity bit (SB Hit field) in the load buffer if data came 
from store buffer (pg. 37, para. 3, line 4; pg. 38, lines 1-2). In order for data to come from the 
store buffer, a store instruction address (including store instructions that are not replayed) must 
match a store ID portion (address) of the load entry. One of ordinary skill in the art would have 
recognized that one could use the load and store buffer arrangement of Akkary in the 
Sundaramoorthy reference in order handle loads and stores in the multithreaded environment. 
Therefore it would have been obvious to one of ordinary skill in the art at the time of the 
invention to have modified the Sundaramoorthy reference by setting a store validity bit if a store 
instruction that is not replayed matches the store ID portion. 

37. Referring to claim 16, Sundaramoorthy in view of Hennessy has taught a method as 
described in claim 12. Furthermore, although Sundaramoorthy has taught flushing the pipeline 
(reorder buffer) of the R-stream on a misprediction, Sundaramoorthy has not taught flushing a 
pipeline, setting a mispredicted bit in a load entry in the trace buffer and restarting a load 
instruction if one of the load is not replayed and does not match a tag portion in a load buffer, 
and the load instruction matches the tag portion in the load buffer while a store valid bit is not 
set. However, Official Notice is taken that it is well known and expected in the art to use load 
and store buffers for the proper handling of memory operations. Akkary discloses a system for 
ordering loads and stores in a multithreaded processor using load and store buffers (fig. 2, 
182,184). In particular, when a store valid bit is not set (SB hit = 0, pg. 38, para. 2) and when a 
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store instruction compared with the addresses of load instructions (pg. 36, para. 3) is a match, a 
replay event is signaled to the load entry in the trace buffer to replay the load instruction and all 
its dependant instructions because it was mispredicted (pg. 38, para. 2). Furthermore, Official 
Notice is taken that is well known and expected in the art to set a status bit to indicate a 
misprediction. Clearly, in order to detect a misprediction, some bit must change somewhere in 
the system. As shown in In re Larson , 144 USPQ 347 (CCPA 1965), to make integral is 
generally not given patentable weight or would have been an obvious improvement. That is, it 
does not matter where this misprediction bit is located within the system, as long as it exists. 
One of ordinary skill in the art would have recognized that one could use the load and store 
buffer arrangement of Akkary in the Sundaramoorthy reference in order handle loads and stores 
in the multithreaded environment and flush the pipeline on reading the mispredicted bit. 
Therefore it would have been obvious to one ordinary skill in the art at the time of the invention 
to have modified the Sundaramoorthy reference by flushing a pipeline, setting a mispredicted bit 
in a load entry in the trace buffer and restarting a load instruction if one of the load is not 
replayed and does not match a tag portion in a load buffer, and the load instruction matches the 
tag portion in the load buffer while a store valid bit is not set. 

38. Claims 20-22, 26, and 28 are rejected under 35 U S C. 103(a) as being unpatentable over 
Sundaramoorthy in view of Hennessy, as applied above, and further in view of Tanenbaum, 
"Structured Computer Organization," Prentice-Hall, 1984, pp. 10-12 (as applied in the previous 
Office Action and herein referred to as Tanenbaum). 

39. Referring to claim 20, Sundaramoorthy has taught operations comprising: 
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a) executing a plurality of instructions in a first thread by a first processor (col. 1, lines 53-54, 
col. 2, lines 18-23: The R-stream thread is executed by the R-stream processor in fig. 1). 

b) executing the plurality of instructions in the first thread by a second processor (col. 1, lines 
53-54, col 2,, lines 18-32: The A-stream thread, which is the same as the R-stream thread, is 
executed by the A-stream processor) as directed by the first processor (col. 4, lines 21-38: IR- 
detector and IR-predictor in fig. 1, considered part of the first processor i.e. R-stream processor, 
direct the second processor (A-stream processor) to execute instructions from the A-stream), the 
second processor executing the plurality of instructions ahead of the first processor (col. 2, lines 
20-23: A-stream runs ahead of the R-stream and it is executed by the second processor (A-stream 
processor)). 

c) Sundaramoorthy has not taught tracking at least one register that is one of loaded from a 
register file buffer, and written by said second processor, said tracking executed by said second 
processor. However, Hennessy has taught the idea of a scoreboard which allows instructions to 
execute out of order. As is known in the art, out-of-order execution is advantageous because it 
allows instructions to execute as soon as their resources are ready, thereby reducing stalling and 
CPU idleness. See pages 241 and 242. As a result, in order to allow the second processor to 
benefit from such execution and resulting advantages, it would have been obvious to one of 
ordinary skill in the art at the time of the invention to modify the second processor of 
Sundaramoorthy to include a scoreboard. And, the inherent nature of a scoreboard is to track 
registers written by the second processor. See Fig.4.4 on page 247, and note that the system 
tracks when registers are ready so that execution may continue. For registers to be ready, it must 
be tracked when the writing to those registers completes. 
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d) Sundaramoorthy in view of Hennessy does not disclose an apparatus comprising a machine- 
readable medium containing instructions which, when executed by a machine to perform the 
aforementioned operations. However, Tanenbaum has taught that any instruction executed by 
hardware can also be simulated in software (pg 1 1, para. 4, lines 1-2). He also teaches that 
hardware is generally immutable (first para, after sec. 1.4 header) while software allows for more 
rapid change (pg. 1 1, para. 4, lines 2-4). One of ordinary skill in the art at the time of the 
invention would have been motivated to convert the Sundaramoorthy reference to software i.e. 
instructions on a machine readable medium because Tanenbaum teaches that hardware is 
generally immutable (first para, after sec. 1.4 header) while software allows for more rapid 
change (pg. 1 1, para. 4, lines 2-4). Therefore, to allow for ease of correction of mistakes, and/or 
an ease of addition of new functionality, it would have been obvious to one of ordinary skill in 
the art to have implemented the method of Sundaramoorthy by an apparatus comprising 
instructions recorded on a machine readable medium. 

40. Referring to claim 21, Sundaramoorthy in view of Hennessy and further in view of 
Tanenbaum has taught an apparatus as described in claim 20. Sundaramoorthy has further taught 
transmitting control flow information from the second processor to the first processor, the first 
processor avoiding branch prediction by receiving the control flow information (col. 10, lines 17- 
21,30-35,43-46). 

41. Referring to claim 22, Sundaramoorthy in view of Hennessy and further in view of 
Tanenbaum has taught an apparatus as described in claim 21 . Sundaramoorthy has further taught 
duplicating memory information in separate memory devices for independent access by the first 
processor and the second processor (Although this is not mentioned explicitly, it is deemed 
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inherent to the design because as each processor is executing the same thread (col. 1, lines 53-54, 
col. 2, lines 18-20) the instruction and data caches in each processor (fig. 1) must contain exact 
copies of instructions and data). 

42. Referring to claim 26, Sundaramoorthy in view of Hennessy and further in view of 
Tanenbaum has taught an apparatus as described in claim 21 . Sundaramoorthy has further 
taught: 

a) executing a replay mode at a first instruction of a speculative thread (col 1, lines 53-54, col 2, 
lines 18-20; This feature is deemed inherent to the reference because when the A-stream is 
initially started i.e. at the first instruction, there will be two redundant threads being executed 
which means the thread is being replayed from that point. This can be called a replay mode). 

b) terminating the replay mode and the execution of the speculative thread if a partition in the 
trace buffer is approaching an empty state (this limitation is also deemed inherent to the 
reference because when the partition in the trace buffer (delay buffer col. 10 lines 15+) is 
approaching an empty state that means the A-stream has stopped producing results and finished 
executing. Therefore now the replay mode and the A-stream are terminated). 

43. Referring to claim 28, Sundaramoorthy in view of Hennessy and further in view of 
Tanenbaum has taught an apparatus as described in claim 2L Sundaramoorthy has further taught 
clearing a valid bit in an entry in a load buffer (fig. 1, the reorder buffer connected to the R- 
stream processor) if the load entry is retired (Although not explicitly mentioned, it is deemed 
inherent to the design because a load entry, on being retired, has to be marked invalid to ensure 
that other new instructions can occupy that entry safely). 
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44. Claims 23-25 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Sundaramoorthy in view of Hennessy in view of Tanenbaum, as applied above, and further in 
view of Akkary, as applied above. 

45. Referring to claims 23-25, Sundaramoorthy in view of Hennessy has taught and farther in 
view of Tanenbaum has taught an apparatus as described in claim 21. Furthermore, claims 23-25 
are rejected for the same reasons set forth in the rejections of claims 14-16, respectively. 

Response to Arguments 

46. Applicant's arguments filed on July 20, 2004, have been fully considered but they are not 
persuasive. 

47. Applicant argues the novelty/rejection of claims 14 and 23 on pages 21 and 25 of the 

remarks, respectively, in substance that: 

"It is asserted in the Office Action that Official Notice is taken that it is well 

known and expected to set a status bit to indicate the change of state. Applicant agrees. 

However, how a state is changed and which specific bits are set in order to effect actions 

is not well known or expected, and much narrower than the broad limitation of setting 

of a status bit to indicate state. Therefore, Applicant respectfully traverses the Official 

Notice." 

48. These arguments are not found persuasive for the following reasons: 

a) As discussed in the above rejection, a misprediction must be detected. If a misprediction is 
not detected then the program's execution would be corrupt as it is executing based on incorrect 
information. Consequently, at least one bit must change somewhere within the system to denote 
a misprediction. The examiner asserts that which bit is changed is an obvious modification 
because the bit can be located anywhere within the system as long as the processor knows where 
to access it. 
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49. Applicant argues the novelty/rejection of claims 18 and 27 on pages 23 and 27 of the 
remarks, respectively, in substance that: 

"Moreover, the reason Hennessy and Patterson are relied on is not consistent with Applicant's 
claim language. Applicant's claim 18 contains the limitations of "supplying names from the trace 
buffer to preclude register renaming." Merriam-Webster online dictionary defines the meaning of 
"preclude" as "to make impossible by necessary consequence : rule out in advance." Therefore, 
while Applicant is supplying names from the trace buffer, Applicant is doing so to not rename 
registers. Therefore, Applicant respectfully asserts the Hennessy and Patterson disclosure does 
not teach, disclose or suggest precluding register renaming." 

50. These arguments are not found persuasive for the following reasons: 

a) The examiner agrees that Hennessy does not teach precluding register renaming. However, 
that is why the examiner has made a 103 rejection. In essence, the examiner stated that it would 
be obvious to perform renaming once for improving performance, and since register renaming is 
already done once, it would have been obvious to preclude register renaming because it is 
redundant or illogical to do it for a second time in the other processor because the data is already 
being passed through the trace buffer. Applicant's claim does not claim that register renaming is 
not used in the entire system and the examiner asserts that not performing register renaming in 
one of the processors due to the trace buffer supplying names, also reads on "precluding register 
renaming" (in the one of the processors). 

Conclusion 

5 1 . Applicant's amendment necessitated the new ground(s) of rejection presented in this 
Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). 
Applicant is reminded of the extension of time policy as set forth in 37 CFR 1 . 136(a). 
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A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO 
MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period 
will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 
CFR 1 .136(a) will be calculated from the mailing date of the advisory action. In no event, 
however, will the statutory period for reply expire later than SIX MONTHS from the date of this 
final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's 
disclosure. Applicant is reminded that in amending in response to a rejection of claims, the 
patentable novelty must be clearly shown in view of the state of the art disclosed by the 
references cited and the objections made. Applicant must also show how the amendments avoid 
such references and objections. See 37 CFR §1.11 1(c). 

Devereux et al, U.S. Patent No. 5,961,631, has taught a data processing apparatus and 
method for prefetching an instruction to an instruction cache. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to David J. Huisman whose telephone number is (703) 305-781 1. 
The examiner can normally be reached on Monday-Friday (8:00-4:30). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Eddie Chan can be reached on (703) 305-9712. The fax phone number for the 
organization where this application or proceeding is assigned is 703-872-9306. 



Application/Control Number: 09/896,526 Page 28 

Art Unit: 2183 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 




